Project: Predict Graduation Rates (2019 U.S. College Data)¶

Author: Robert Zacchigna¶

Table of Contents¶

Introduction
Research Questions
Addressing Problem Statement
Dataset - U.S. College Data (2019)
- Download Location
- Columns
Libraries

Part 1: Exploratory Data Analysis
- Import and Clean Dataset
- Correlation Matrix

Part 2: Model Analysis

Part 3: Research Question Answers

Conclusion
What this Means for the Universities
Limitations and Improvements

Introduction¶

The problem I will be addressing is how factors such as whether the University is Private, number of application, number of students accepted, number of enrollments, class rank in High school, out of state cost, room and board, Book cost, professors with PhD’s, student to faculty ratio, and expenditure per student can predict a university's student graduation rate.

University administrators would be interested to know what factors impact a student graduating from their university and how they could possibly increase the amount that graduate.

Research Questions¶

Which variables have the biggest significant impact on the graduation rate of a university?
Does an increase in professors with PhD’s have an effect on the student’s graduation rate?
Does a university being private or public have an effect on student graduation rate?
What is the distribution of universities with graduation rates higher than 30%, 50%, 70%, 90%?
Which model (linear, randomForest, clustering, etc.) is the most accurate to predict the graduation rate given the variables mentioned in the introduction?

Addressing Problem Statement¶

This approach will determine which of the above variables have a significant impact, create multiple models to determine the graduation rates of universities, and calculate accuracy of the models to determine the best one to use.

Dataset - U.S. College Data (2019)¶

Download Location: https://www.kaggle.com/yashgpt/us-college-data

Columns:

Private – Categorical variable of Yes and No indicating private or public university (Yes = Private, No = Public)
Apps – Number of applications received
Accept – Number of applications accepted
Enroll – Number of new students enrolled
Top10perc - Percentage of new students from top 10% of H.S. class
Top25perc - Percentage of new students from top 25% of H.S. class
F.Undergrad - Number of fulltime undergraduates
P.Undergrad - Number of parttime undergraduates
Outstate - Out-of-state tuition
Room.Board – Room and Board costs (U.S. Dollars)
Books – Estimated Book costs (U.S. Dollars)
Personal – Estimated personal spending (U.S. Dollars)
PhD – Percentage of faculty with Ph.D.’s
Terminal - Percentage of faculty with terminal degree
S.F.Ratio - Student/Faculty Ratio
perc.alumni - Percentage alumni who donate
Expend – Instructional expenditure per student (U.S. Dollars)
Grad.Rate – Graduation rate (Percentage)

Libraries¶

library(e1071)
library(dplyr)
library(caret)
library(purrr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(tidyverse)
library(randomForest)

Part 1: Exploratory Data Analysis¶

Import and Clean Dataset¶

college_data <- read.csv("College_Data.csv")

head(college_data)

Convert "Yes" (Private) and "No" (Public) to Numbers, Yes = 2, No = 1¶

college_data$Private <- as.integer(ifelse(college_data$Private=="No", 1, 2))

Round S.F.Ratio and Convert to Integer¶

# Round S.F.Ratio and convert to integer
college_data$S.F.Ratio <- round(college_data$S.F.Ratio)
college_data$S.F.Ratio <- as.integer(college_data$S.F.Ratio)

Rename Rows to University Names and Remove Uni.Name Column¶

rownames(college_data) <- college_data$Uni.Name
college_data <- within(college_data, rm(Uni.Name))

Plot Histograms of Variables to See Outliers¶

options(repr.plot.width=16, repr.plot.height=12)
par(mfrow=c(2,3))

for(i in colnames(college_data)){
    variable = as.numeric(unlist(college_data[i]))

    collegeHists <- hist(variable, main = paste(i, "Histogram", sep=" "), xlab = i)
    text(collegeHists$mids, collegeHists$counts, labels=collegeHists$counts, adj=c(0.5, -0.5))

    xfit <- seq(min(variable), max(variable), length = nrow(college_data))
    yfit <- dnorm(xfit, mean = mean(variable), sd = sd(variable))
    yfit <- yfit * diff(collegeHists$mids[1:2]) * length(variable) 

    lines(xfit, yfit, col = "Blue", lwd = 2)
}

Filter Dataset by Three Standard Deviations Away From the Mean¶

Use data that is at most three standard deviations away from the mean of that particular variable in both the positive and negative directions.

Find all Standard Deviations Away From the Mean¶

# Initialize data frame
Value_MinMax <- as.data.frame(cbind("temp" = c(0, 1)))
    
# Find all standard deviations that are three times away 
# from the mean in the positive and negative direction
for(i in colnames(college_data)){
  variable = as.numeric(unlist(college_data[i]))

  Value_MinMax <- cbind(Value_MinMax, 
                        i = c(round(mean(as.numeric(variable)) + sd(variable) * 3),
                              round(mean(as.numeric(variable)) - sd(variable) * 3)))
}

# Remove temp column for data frame initialization
Value_MinMax <- subset(Value_MinMax, select=-c(temp))

# Apply proper column names for the variables
colnames(Value_MinMax) <- colnames(college_data)

# Apply row names
rownames(Value_MinMax) <- c("Max Accepted Value", "Min Accepted Value")

Value_MinMax

Apply Filter¶

clean_data <- college_data %>% rownames_to_column('gene') %>% 
      filter(Private %in% (Value_MinMax[[1]][1]:Value_MinMax[[1]][2]),
             Apps %in% (Value_MinMax[[2]][1]:Value_MinMax[[2]][2]),
             Accept %in% (Value_MinMax[[3]][1]:Value_MinMax[[3]][2]),
             Enroll %in% (Value_MinMax[[4]][1]:Value_MinMax[[4]][2]),
             Top10perc %in% (Value_MinMax[[5]][1]:Value_MinMax[[5]][2]),
             Top25perc %in% (Value_MinMax[[6]][1]:Value_MinMax[[6]][2]),
             F.Undergrad %in% (Value_MinMax[[7]][1]:Value_MinMax[[7]][2]),
             P.Undergrad %in% (Value_MinMax[[8]][1]:Value_MinMax[[8]][2]),
             Outstate %in% (Value_MinMax[[9]][1]:Value_MinMax[[9]][2]),
             Room.Board %in% (Value_MinMax[[10]][1]:Value_MinMax[[10]][2]),
             Books %in% (Value_MinMax[[11]][1]:Value_MinMax[[11]][2]),
             Personal %in% (Value_MinMax[[12]][1]:Value_MinMax[[12]][2]),
             PhD %in% (Value_MinMax[[13]][1]:Value_MinMax[[13]][2]),
             Terminal %in% (Value_MinMax[[14]][1]:Value_MinMax[[14]][2]),
             S.F.Ratio %in% (Value_MinMax[[15]][1]:Value_MinMax[[15]][2]),
             perc.alumni %in% (Value_MinMax[[16]][1]:Value_MinMax[[16]][2]),
             Expend %in% (Value_MinMax[[17]][1]:Value_MinMax[[17]][2]),
             Grad.Rate %in% (Value_MinMax[[18]][1]:Value_MinMax[[18]][2])) %>% 
      column_to_rownames('gene')

Clean Dataset Structure¶

# Structure of cleaned dataset
paste0('Cleaned Dataset Structure')
str(clean_data)

# Cleaned Dataset
paste0('Cleaned Dataset Head')
head(clean_data)

'data.frame':	685 obs. of  18 variables:
 $ Private    : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Apps       : int  1660 1428 417 193 587 353 1899 1038 582 1732 ...
 $ Accept     : int  1232 1097 349 146 479 340 1720 839 498 1425 ...
 $ Enroll     : int  721 336 137 55 158 103 489 227 172 472 ...
 $ Top10perc  : int  23 22 60 16 38 17 37 30 21 37 ...
 $ Top25perc  : int  52 50 89 44 62 45 68 63 44 75 ...
 $ F.Undergrad: int  2885 1036 510 249 678 416 1594 973 799 1830 ...
 $ P.Undergrad: int  537 99 63 869 41 230 32 306 78 110 ...
 $ Outstate   : int  7440 11250 12960 7560 13500 13290 13868 15595 10468 16548 ...
 $ Room.Board : int  3300 3750 5450 4120 3335 5720 4826 4400 3380 5406 ...
 $ Books      : int  450 400 450 800 500 500 450 300 660 500 ...
 $ Personal   : int  2200 1165 875 1500 675 1500 850 500 1800 600 ...
 $ PhD        : int  70 53 92 76 67 90 89 79 40 82 ...
 $ Terminal   : int  78 66 97 72 73 93 100 84 41 88 ...
 $ S.F.Ratio  : int  18 13 8 12 9 12 14 11 12 11 ...
 $ perc.alumni: int  12 30 37 2 11 26 37 23 15 31 ...
 $ Expend     : int  7041 8735 19016 10922 9727 8861 11487 11644 8991 10932 ...
 $ Grad.Rate  : int  60 54 59 15 55 63 73 80 52 73 ...

Clean Histograms¶

options(repr.plot.width=16, repr.plot.height=12)
par(mfrow=c(2,3))

for(i in colnames(clean_data)){
    variable = as.numeric(unlist(clean_data[i]))
    
    cleanHists <- hist(variable, main = paste(i, "Histogram", sep=" "), xlab = i)
    text(cleanHists$mids, cleanHists$counts, labels=cleanHists$counts, adj=c(0.5, -0.5))

    xfit <- seq(min(variable), max(variable), length = nrow(clean_data))
    yfit <- dnorm(xfit, mean = mean(variable), sd = sd(variable))
    yfit <- yfit * diff(cleanHists$mids[1:2]) * length(variable) 

    lines(xfit, yfit, col = "Blue", lwd = 2)
}

Compare Original Dataset to Cleaned Dataset¶

We can also see that of the original 777 records, only 92 were removed, thus leaving us with 685 total records for this projects analysis.

paste0(nrow(college_data) - nrow(clean_data), ' Records Removed')

Now that we have the final dataset, we can dive deeper into how each of the variables correlate to each other. The best way i thought to do this was to create a correlation heatmap of all of the variables in my dataset to see which ones have the highest correlation with the graduation rate(Grad.Rate)

Correlation Matrix¶

# Correlation Matrix reordering by distance Function
reorder_cormat <- function(cormat){
  # Use correlation between variables as distance
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <-cormat[hc$order, hc$order]
}

cormat <- round(cor(clean_data), 2)

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)

# Melt the correlation matrix
melted_cormat <- melt(cormat)

# Create the correlation matrix heatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() + # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 15, hjust = 1)) +
  coord_fixed()

options(repr.plot.width=14, repr.plot.height=14)
par(mfrow=c(1,1))

# Add title and correlation values to squares
ggheatmap + labs(title = "Correlation Matrix Heatmap") + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(
    text = element_text(size = 20),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank())

Looking at the heatmap we can see that the variable with the highest correlation with Grad.Rate is Outstate at 0.54 and perc.alumni coming in close with 0.48. Neither of these values are good correlation with Grad.Rate and the rest of the variables fair far worse.

Before making any more assumptions i decided to run linear models on Grad.Rate with each of the other variables to determine their R^2 values and calculate the RMSE. These values, combined with the heatmap, will give me a better idea of how the variables relate to Grad.Rate and if there is any significance between them.

Part 2: Model Analysis¶

R^2 and RMSE¶

Calculate RMSE and R^2 values and store values into a dataframe for easy plotting

# RMSE calculation Function
RMSE <- function(error) { sqrt(mean(error^2)) }

# Initialize Dataframe to hold R^2 and RMSE values
RRMSE_df <- as.data.frame(rbind("temp" = c(0, 1, 2, 4)))

# All Variable Names
nameList <- names(clean_data)

# Loop to create linear models with each of the variables, with Grad.Rate as the independent variable
for (x in nameList){
  # Creates the models
  model <- lm(substitute(Grad.Rate ~ i, list(i = as.name(x))), clean_data)

  # Appending values to dataframe
  RRMSE_df <- rbind(RRMSE_df, i = c(x, summary(model)$r.squared, 
                                    summary(model)$adj.r.squared,
                                    RMSE(model$residuals)))
}

# Remove "temp" record from dataframe initialization
RRMSE_df <- RRMSE_df[-1,]

# Apply column names to new dataframe
colnames(RRMSE_df) <- c("Name", "R^2", "Adjusted R^2", "RMSE")

# Round each column values to 4 decminal places
RRMSE_df$`R^2` <- round(as.numeric(RRMSE_df$`R^2`), 4)
RRMSE_df$`Adjusted R^2` <- round(as.numeric(RRMSE_df$`Adjusted R^2`), 4)
RRMSE_df$RMSE <- round(as.numeric(RRMSE_df$RMSE), 4)

# Reorder R^2 column from least to greatest value
RRMSE_df <- RRMSE_df[order(RRMSE_df$`R^2`), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
RRMSE_df$Name <- factor(RRMSE_df$Name, levels=unique(as.character(RRMSE_df$Name)))
RRMSE_df <- transform(RRMSE_df, Name=reorder(Name, `R^2`))

Warning message in model.matrix.default(mt, mf, contrasts):
"the response appeared on the right-hand side and was dropped"
Warning message in model.matrix.default(mt, mf, contrasts):
"problem with term 1 in model.matrix: no columns are assigned"

R^2¶

options(repr.plot.width=14, repr.plot.height=10)
par(mfrow=c(1,1))

# R^2 Graph from Models
ggplot(RRMSE_df, aes(R.2, Name)) + 
  labs(title = "R^2 of each model with Grad.Rate as the Indepedent Variable", 
       x = "R^2", y = "Variable Name") +
  scale_x_continuous(breaks = round(seq(min(RRMSE_df$R.2), max(RRMSE_df$R.2), by = 0.02), 2)) +
  geom_point() + 
  theme(text = element_text(size = 20))

We are looking for the variable with the highest R^2 and from this graph we can see that OutState has the highest R^2 of about 0.29.

This is not a good R^2 value but we do need to keep in mind that these models were only with one variable each and further modeling with multiple variables still needs to be done. This was simply to get an idea of how each of the variables fair on their own against Grad.Rate.

RMSE¶

options(repr.plot.width=14, repr.plot.height=10)
par(mfrow=c(1,1))

# Reorder RMSE column from least to greatest value
RRMSE_df <- RRMSE_df[order(RRMSE_df$RMSE), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
RRMSE_df$Name <- factor(RRMSE_df$Name, levels=unique(as.character(RRMSE_df$Name)))
RRMSE_df <- transform(RRMSE_df, Name=reorder(Name, RMSE))

# RMSE calculation graph from model$residuals
ggplot(RRMSE_df, aes(RMSE, Name)) + 
  labs(title = "RMSE of each model$residuals with Grad.Rate as the Indepedent Variable", 
       x = "RMSE", y = "Variable Name") +
  scale_x_continuous(breaks = round(seq(min(RRMSE_df$RMSE), max(RRMSE_df$RMSE), by = 0.2), 1)) +
  geom_point() + 
  theme(text = element_text(size = 20))

We are looking for the variable with the lowest RMSE and from this graph we can see that Outstate has the lowest RMSE at about 13.7. The RMSE value on its own means nothing, it needs to be compared to the RMSE's of the other models, which is why I graphed the other RMSE's of the other models as well. This is what allows us to determine that Outstate has the lowest RMSE. Ideally we are looking for the model with the highest R^2 and the lowest RMSE, which right now appears to be Grad.Rate ~ Outstate.

Next Steps¶

With my current analysis so far, this is not a good model in predicting the graduation rate of universities and my conclusion would be that these variables are not good predictors of whether or not a student graduates. My next steps are to start combining multiple variables into models to see if I can improve those scores before moving on to other machine learning algorithms for further analysis.

The next step is start making models to find the significant variables and hopefully create better models as a result of what we know so far. I have created 4 linear models below and we will look at each of them to see the process i went through to arrive at the one i have selected.

Linear Modeling and Comparison¶

Model 1: All Variables¶

clean_lm1 <- lm(Grad.Rate ~ ., data = clean_data)

# Summary of the model
summary(clean_lm1)

Call:
lm(formula = Grad.Rate ~ ., data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.350  -7.248  -0.048   6.746  48.392 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.4499925  6.3335292   5.124 3.93e-07 ***
Private      4.8389269  1.8386518   2.632  0.00869 ** 
Apps         0.0020786  0.0007044   2.951  0.00328 ** 
Accept      -0.0012068  0.0013795  -0.875  0.38199    
Enroll       0.0038222  0.0032111   1.190  0.23435    
Top10perc    0.0095701  0.0823620   0.116  0.90753    
Top25perc    0.1909495  0.0603809   3.162  0.00164 ** 
F.Undergrad -0.0005814  0.0005513  -1.055  0.29198    
P.Undergrad -0.0015501  0.0007821  -1.982  0.04790 *  
Outstate     0.0012136  0.0002558   4.744 2.56e-06 ***
Room.Board   0.0020911  0.0006336   3.300  0.00102 ** 
Books       -0.0046632  0.0040718  -1.145  0.25252    
Personal    -0.0022758  0.0009267  -2.456  0.01431 *  
PhD          0.1403826  0.0645453   2.175  0.02998 *  
Terminal    -0.0943899  0.0686826  -1.374  0.16981    
S.F.Ratio   -0.1393575  0.1960667  -0.711  0.47748    
perc.alumni  0.3017563  0.0510849   5.907 5.55e-09 ***
Expend      -0.0014083  0.0002755  -5.112 4.16e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.27 on 667 degrees of freedom
Multiple R-squared:  0.4506,	Adjusted R-squared:  0.4366 
F-statistic: 32.18 on 17 and 667 DF,  p-value: < 2.2e-16

We are looking for all of the variables with a p-value less than 0.05, so make it easier to see I have filtered all the other variables that do not fit that criteria.

Significant Variables (P-Values)¶

This list shows us all the variables that are significant in this model and thus in the next model we will only use these variables and then repeat the process until we have a desired model.

p_values <- data.frame(cbind("P.value" = summary(clean_lm1)$coef[summary(clean_lm1)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]

Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"

Model 2: Significant Variables from Model 1¶

clean_lm2 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + Top25perc + 
                Apps + Private + Personal + PhD + P.Undergrad, clean_data)
    
summary(clean_lm2)

Call:
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + 
    Top25perc + Apps + Private + Personal + PhD + P.Undergrad, 
    data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.609  -7.503  -0.212   6.829  51.101 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.4947692  4.1787646   6.101 1.78e-09 ***
perc.alumni  0.3048820  0.0503760   6.052 2.37e-09 ***
Expend      -0.0013222  0.0002309  -5.725 1.55e-08 ***
Outstate     0.0011706  0.0002460   4.758 2.40e-06 ***
Room.Board   0.0019596  0.0006084   3.221  0.00134 ** 
Top25perc    0.1968159  0.0339008   5.806 9.88e-09 ***
Apps         0.0015177  0.0002642   5.744 1.40e-08 ***
Private      5.4867688  1.7262886   3.178  0.00155 ** 
Personal    -0.0024475  0.0008903  -2.749  0.00614 ** 
PhD          0.0632297  0.0424775   1.489  0.13707    
P.Undergrad -0.0017891  0.0007068  -2.531  0.01160 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.26 on 674 degrees of freedom
Multiple R-squared:  0.4456,	Adjusted R-squared:  0.4374 
F-statistic: 54.17 on 10 and 674 DF,  p-value: < 2.2e-16

Significant Variables (P-Values)¶

p_values <- data.frame(cbind("P.value" = summary(clean_lm2)$coef[summary(clean_lm2)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]

Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"

We can see from the new p-value list from this model that PhD is no longer significant. As a result, we can drop it from the next model. We can check the improvement of each of these models by using anova.

Model Improvement (1 and 2)¶

anova(clean_lm1, clean_lm2)

Ideally we want the Pr(>F) value to be less than 0.05, so while the model improved it did not improve enough to reject the null hypothesis and be significant.

Model 3: Significant Variables from Model 2¶

clean_lm3 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + 
                Top25perc + Apps + Private + Personal + P.Undergrad, clean_data)
    
summary(clean_lm3)

Call:
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + 
    Top25perc + Apps + Private + Personal + P.Undergrad, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.422  -7.732  -0.354   6.807  50.036 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.9019158  3.4992836   8.259 7.73e-16 ***
perc.alumni  0.3095325  0.0503242   6.151 1.32e-09 ***
Expend      -0.0012584  0.0002271  -5.541 4.32e-08 ***
Outstate     0.0012215  0.0002439   5.009 6.99e-07 ***
Room.Board   0.0020489  0.0006059   3.381 0.000763 ***
Top25perc    0.2103792  0.0326827   6.437 2.31e-10 ***
Apps         0.0015451  0.0002638   5.857 7.38e-09 ***
Private      4.7486236  1.6550216   2.869 0.004243 ** 
Personal    -0.0024522  0.0008911  -2.752 0.006084 ** 
P.Undergrad -0.0016953  0.0007047  -2.406 0.016399 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.27 on 675 degrees of freedom
Multiple R-squared:  0.4438,	Adjusted R-squared:  0.4363 
F-statistic: 59.84 on 9 and 675 DF,  p-value: < 2.2e-16

Significant Variables (P-Values)¶

p_values <- data.frame(cbind("P.value" = summary(clean_lm3)$coef[summary(clean_lm3)$coef[,4] <= .05, 4]))
    
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]

Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"

Model Improvement (1, 2, and 3)¶

anova(clean_lm1, clean_lm2, clean_lm3)

From the anova() function, the third model has improved a lot from the second model but the Pr(>F) value is still too low to reject the null hypothesis. So going back to the third models p-value list, we will drop the variables with the biggest p-value.

To speed things up i will be dropping the Room.Board, Personal, and P.Undergrad variables.

Model 4: Significant Variables from Model 3¶

clean_lm4 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private, clean_data)

summary(clean_lm4)

Call:
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + 
    Apps + Private, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.347  -8.083  -0.481   7.369  50.997 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.6802463  2.7000337   9.511  < 2e-16 ***
perc.alumni  0.3314103  0.0497241   6.665 5.49e-11 ***
Expend      -0.0012029  0.0002272  -5.295 1.61e-07 ***
Outstate     0.0016514  0.0002282   7.237 1.24e-12 ***
Top25perc    0.2100151  0.0329672   6.370 3.48e-10 ***
Apps         0.0014315  0.0002529   5.661 2.22e-08 ***
Private      6.2598739  1.6331208   3.833 0.000138 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.47 on 678 degrees of freedom
Multiple R-squared:  0.4232,	Adjusted R-squared:  0.418 
F-statistic: 82.89 on 6 and 678 DF,  p-value: < 2.2e-16

Significant Variables (P-Values)¶

p_values <- data.frame(cbind("P.value" = summary(clean_lm4)$coef[summary(clean_lm4)$coef[,4] <= .05, 4]))
    
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]

Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"

Model Improvement (1, 2, 3, and 4)¶

anova(clean_lm1, clean_lm2, clean_lm3, clean_lm4)

We can see from the anova() function that the fourth model has improved significantly from the previous models, with a Pr(>F) value way below 0.05. With this value we can say that the model rejects the null hypothesis. Based on all the information i have been able to deduce from my data, this is the best model i can come up with and will be used for the remainder of the analysis.

R^2 and RMSE all Four Models¶

options(repr.plot.width=16, repr.plot.height=10)

models <- paste0("clean_lm", 1:4)
    
models_df <- as.data.frame(cbind("Name" = as.character(models), 
                                 "R^2" = sapply(models, function(x)round(summary(get(x))$r.squared, 4)), 
                                 "RMSE" = sapply(models, function(x)round(RMSE(get(x)$residuals), 4))))

# Reorder R^2 column from least to greatest value
models_df <- models_df[order(models_df$`R^2`), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
models_df$Name <- factor(models_df$Name, levels=unique(as.character(models_df$Name)))
models_df <- transform(models_df, Name=reorder(Name, as.numeric(`R^2`)))

rsquared_Plot <- ggplot(models_df, aes(R.2, Name)) + 
  labs(title = "R^2 of each of the Four models", 
       x = "R^2", y = "Models") +
  geom_point(size=3) + 
  theme(text = element_text(size = 20))

# Reorder RMSE column from least to greatest value (dont eval this)
models_df <- models_df[order(models_df$RMSE, decreasing = TRUE), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
models_df$Name <- factor(models_df$Name, levels=unique(as.character(models_df$Name)))
models_df <- transform(models_df, Name=reorder(Name, as.numeric(RMSE)))

# RMSE calculation graph from model$residuals
RMSE_Plot <- ggplot(models_df, aes(RMSE, Name)) + 
  labs(title = "RMSE of each of the Four models", 
       x = "RMSE", y = "Models") +
  geom_point(size=3) + 
  theme(text = element_text(size = 20))

grid.arrange(rsquared_Plot, RMSE_Plot)

While the fourth model does not have the largest R^2 or the smallest RMSE compared to the other models, the anova function tells us that it is the model with the best improvement from the previous models.

We can also take a look at the models in these graph views below, the main one we want to look at is the Normal Q-Q plot in the top right corner. We want the grouping of points to be as close to the dotted line as possible.

Linear Analysis Plots of all Models¶

options(repr.plot.width=16, repr.plot.height=10)

# Plotting Function for plotting all of the linear models graphs
plot_func <- function(x){
  par(mfrow=c(2,2))
  plot(get(x))
  title(x, line = -2, outer = TRUE)
}

# Plots the models
sapply(models, plot_func) + theme(text = element_text(size = 20))

NULL

As you look through the analysis graphs, the point grouping on Normal Q-Q plot gets closer and closer to the line, with the last model (clean_lm4) having the points closest to the line. All of these graphs still have skew on right hand tail but the fourth model is the best fit of them all.

For comparisons sake, i will do the same tests with the first model (using all variables) to compare the result of the fourth model.

Split Dataset into Train and Test¶

Now that I have the models that i want to use, i will be using some machine learning algorithms to start training and testing my data predictions and find the accuracy of my models.

First i will find the accuracy of my linear models predictions on my data and then i will move onto using randomforest and K Nearest Neighbor (knn) models.

# The models I will be using
model1 <- (Grad.Rate ~ .)
model4 <- (Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private)

# First split the data into train and tests sets, 
# train will have 80% and test will have 20% of the data
in_train <- createDataPartition(clean_data$Grad.Rate, p = 0.8, list = FALSE)

# Create train and test datasets
train <- clean_data[in_train,]
test <- clean_data[-in_train,]

# Convert the Grad.Rate to a factor
test$Grad.Rate <- as.factor(test$Grad.Rate)

Linear Model Performance¶

# Train our first model, lm1
clean_lm1 <- train(model1, method = "lm", data = train)

# Train our fourth model, lm4
clean_lm4 <- train(model4, method = "lm", data = train)

# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind("clean_lm1" = cbind(clean_lm1$results["RMSE"], clean_lm1$results["Rsquared"]), 
                   "clean_lm4" = cbind(clean_lm4$results["RMSE"], clean_lm4$results["Rsquared"]))

Model 1: All Variables¶

# clean_lm1's Output
print(clean_lm1)

Linear Regression 

549 samples
 17 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  12.55749  0.4309828  9.625618

Tuning parameter 'intercept' was held constant at a value of TRUE

Model 4: Significant Variables¶

# clean_lm4's Output
print(clean_lm4)

Linear Regression 

549 samples
  6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  12.35196  0.4366988  9.540643

Tuning parameter 'intercept' was held constant at a value of TRUE

From the result of just linear models so far, it looks like the fourth model faired just a bit better than the first model with all of the variables, which is good because it means that the variables I removed from the first model did in fact improve the fourth model.

From each of the models output, we are looking at the RMSE and R^2 values. The RMSE is how close on average the predicted value was to the actual value, so the closer this value is to 0 the better. The bigger the R^2 value the better, with 1 being the best because it means that the independent variable explains more of the variance in the model from the dependent variables. We will need to complete all the other models and compare all the value together before we can draw any conclusions from their results.

RandomForest Model Performance¶

# Train first reandomForest model
clean_rf1 <- train(model1, method = "rf", data = train, trcontrol = trainControl(method = "none"))

# Train fourth randomForest model
clean_rf4 <- train(model4, method = "rf", data = train, trcontrol = trainControl(method = "none"))

# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind(RMSE_data, 
                   "clean_rf1" = cbind("RMSE" = clean_rf1$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_rf1$results["Rsquared"][[1]][1]), 
                   "clean_rf4" = cbind("RMSE" = clean_rf4$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_rf4$results["Rsquared"][[1]][1]))

Model 1: All Variables¶

# clean_rf1's Output
print(clean_rf1)

Random Forest 

549 samples
 17 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    12.57814  0.4493274  9.507977
   9    12.57032  0.4416994  9.469833
  17    12.73742  0.4274892  9.601804

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.

Model 4: Significant Variables¶

# clean_rf4's Output
print(clean_rf4)

Random Forest 

549 samples
  6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE      
  2     12.84817  0.4142021   9.694953
  4     13.08392  0.3982994   9.892670
  6     13.30018  0.3836767  10.054632

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

The randomForest model works by tuning the trees to get the best RMSE and R^2. From the model output we see three records under the sample results. Out of all of the results, the first record has the best RMSE and R^2 so it will be recorded and used to measure its effectiveness against the other models.

K Nearest Neighbors Model Performance (Knn)¶

# Train the first knn model
clean_knn1 <- train(model1, method = "knn", data = train, trcontrol = trainControl(method = "none"))

# Train the fourth knn model
clean_knn4 <- train(model4, method = "knn", data = train, trcontrol = trainControl(method = "none"))

# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind(RMSE_data, 
                   "clean_knn1" = cbind("RMSE" = clean_knn1$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_knn1$results["Rsquared"][[1]][1]), 
                   "clean_knn4" = cbind("RMSE" = clean_knn4$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_knn4$results["Rsquared"][[1]][1]))

Model 1: All Variables¶

# clean_knn1's Output
print(clean_knn1)

k-Nearest Neighbors 

549 samples
 17 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  14.84578  0.2677652  11.46806
  7  14.47124  0.2863643  11.14156
  9  14.23814  0.3004641  10.93290

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.

Model 4: Significant Variables¶

# clean_knn4's Output
print(clean_knn4)

k-Nearest Neighbors 

549 samples
  6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  15.28477  0.2378169  11.79466
  7  14.71887  0.2640440  11.32470
  9  14.35938  0.2853791  11.05542

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.

The knn model works by finding which number of clusters is best for the data to get the best RMSE and R^2. From the model output we see three records under the sample results. Out of all of the results, the third record has the best RMSE and R^2 so it will be recorded and used to measure its effectiveness against the other models.

With the analysis finished we can move on to answering the research questions and draw final conclusions about the analysis of this dataset.

Part 3: Research Question Answers¶

Which model (linear, logistic, decision trees, etc.) is the most accurate to predict the graduation rate given the variables mentioned in the introduction?¶

options(repr.plot.width=16, repr.plot.height=6)
par(mfrow=c(1,1))
    
for (i in c(1:ncol(RMSE_data))){
  plot(RMSE_data[[i]], main = paste(colnames(RMSE_data[i]), " of all of models", sep=""), 
       xaxt = "n", ylab = colnames(RMSE_data[i]), 
       xlab = "Models", col="lightblue", pch=19, cex=6)

  axis(1, at=1:6, labels=rownames(RMSE_data))

  abline(v=c(2.5, 4.5))

  # Add values to points for easier viewing
  text(RMSE_data[[i]], labels=round(RMSE_data[[i]], 3), cex=0.8, font=2)
}

Overall, looking at the graphs above, the model with the lowest RMSE and the highest R^2 is the 1st randomforest model (clean_rf1). The randomforest models faired about the same (just slightly better) as the linear models and curiously, the knn models faired far worse than both the linear and randomForest models. However, the R^2 is still very low and the RMSE means that model is only able to predict the Grad Rate of a University from within about the RMSE margin. In terms of a graduation rate, that is quite a large margin and as a result, this should probably not be used to predict a universities' graduation rate.

Which variables have the biggest significant impact on the graduation rate of a university?¶

First Model P-Values

# First linear model with all variables
p_values <- data.frame(cbind("P.value" = summary(clean_lm1)$coef[summary(clean_lm1)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]

Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"

Fourth Model P-Values

# Fourth linear model
p_values <- data.frame(cbind("P.value" = summary(clean_lm4)$coef[summary(clean_lm4)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]

Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"

From the above lists we can see that for the first model the variables: perc.alumni, Expend, Outstate, Top25perc, Private, Personal, Room.Board, and Apps were all significant.

Since the fourth model performed better than the first one, i am inclined to say that the variables listed as significant in the fourth model have a bigger impact on the data than those in the first model: perc.alumni, Top25perc, Outstate, Apps, Expend, and Private.

What is the distribution of universities with graduation rates higher than 30%, 50%, 70%, 90%?¶

options(repr.plot.width=16, repr.plot.height=10)
par(mfrow=c(2,2))
    
for(i in c(30, 50, 70, 90)){
  rate <- clean_data %>% rownames_to_column('gene') %>%
    filter(Grad.Rate >= i) %>%
    column_to_rownames('gene')

  tempHist <- hist(rate$Grad.Rate, main = paste("Grad.Rate > ", i, "% (Total: ", count(rate), ")",sep=""),
                   xlab = "Graduation Rate (%)")

  text(tempHist$mids,tempHist$counts, labels=tempHist$counts, adj=c(0.5, -0.5))
}

Does an increase in professors with PhD’s have an effect on the student’s graduation rate?¶

# Copy dataframe
newPred <- clean_data

# Change the number of teachers with PhD's to 120 for all universities
newPred$PhD <- 120

# Make predictions using clean_lm4 because it had the best RMSE and R^2
pred <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = newPred), 0)))

# Total Universities
print(paste0('Total Universities: ', sum(table(pred$Grad.Rate > clean_data$Grad.Rate))))

[1] "Total Universities: 685"

# Comparison of the grad rates
print('Comparision of Grad Rates')
table(pred$Grad.Rate > clean_data$Grad.Rate)

[1] "Comparision of Grad Rates"

FALSE  TRUE 
  339   346

From the comparison we can see that there is a pretty even split in the data for which universities saw an increase or a decrease in their graduation rate, however, it is important to note that PhD had a very small impact on the models during our analysis and when running the same prediction with the clean dataset, we get the same results. As a result with this model, i am leaning towards that the number of teachers with PhD's does not have significant influence on the graduation rate.

Does a university being private or public have an effect on student graduation rate?¶

# Get list of private universities for prediction and clean datasets
privated <- clean_data %>% filter(Private == 2)
clean_priv <- clean_data %>% filter(Private == 2)

# Get list of public universities for prediction and clean datasets
public <- clean_data %>% filter(Private == 1)
clean_publ <- clean_data %>% filter(Private == 1)

# Change private schools to public
privated$Private <- 1

# Change public schools to private
public$Private <- 2

# Make predictions using clean_lm4 because it had the best RMSE and R^2

# Private to Public prediction
pred_Priv <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = privated), 0)))

# Public to Private Prediction
pred_Publ <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = public), 0)))

# Comparison of grad rates to original dataset

# Total Universities that were originally private
paste0('Total Universities that were originally private: ', sum(table(pred_Priv$Grad.Rate > clean_priv$Grad.Rate)))

# Private to Public comparison
print('Private to Public Comparison')
table(pred_Priv$Grad.Rate > clean_priv$Grad.Rate)

[1] "Public to Private Comparison"

FALSE  TRUE 
  352   164

# Total Universities that were originally public
print('Total universities that were originally public')
sum(table(pred_Publ$Grad.Rate > clean_publ$Grad.Rate))

[1] "Total universities that were originally public"

# Public to Private comparison
print('Public to Private Comparison')
table(pred_Publ$Grad.Rate > clean_publ$Grad.Rate)

[1] "Public to Private Comparison"

FALSE  TRUE 
   56   113

From the first comparison (Private to Public) of 516 originally private universities, about 32% of the total saw an increase in their Grad Rate and the rest (about 68%) either saw a decrease or it stayed the same. Based on this, i would say that if a private university decides to become public then it could be assumed that they would experience a decrease in their graduation rate.

From the second comparison (Public to Private) of 169 originally public universities, about 67% of the total saw an increase in their Grad Rate and the rest (about 33%) either saw a decrease or it stayed the same. Based on this, i would say that if a public university decides to become private then it could be assumed that they would experience an increase in their graduation rate.

However, we do still need to keep in mind that this result is based a model with a wide error margin and should not be taken too confidently.

Conclusion¶

In conclusion, based on all of the analysis, I am confident when I say that this dataset does not have either enough data or the right variables measured to predict the graduation rate of these universities to a confident degree. The models RMSE's and R^2's are not good enough to be able to use the models for graduation rate prediction of the universities. I would recommend that more data be collected and other variables of measure be collected as well to get a better spread of data to be to get more accurate models.

What this Means for the Universities¶

It can mean that the variables measured in this dataset do not actually have an effect on the graduation rate and thus the university administration does not need to spend more time and/or resources improving upon them to boost their grad rate. However, as i said above, i do believe that there needs to be more data collected and possibly more variable measurements added as well to more confidently agree with that statement.

Limitations and Improvements¶

This dataset is only limited to one year of data collection and I think it would benefit greatly if multiple years of data from these universities could be used, then a far more accurate model could be trained to predict their graduation rates. I also think that in order to improve this dataset and by extension the models derived from it, is to collect other bits of data about the university as well, such as: the number of students who graduate, the number of students who drop out by their school year (freshman, Sophomore, Junior, Senior) and possibly student demographic data (to get a better idea of the makeup of the student population).

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Max Accepted Value	3	14612	9372	3568	80	115	18251	5423	22510	7648	1045	3372	122	124	26	60	25325	117
Min Accepted Value	0	-8609	-5335	-2008	-25	-4	-10851	-3712	-1628	1067	54	-691	24	36	2	-14	-6005	14

	Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	667	100358.8	NA	NA	NA	NA
2	674	101270.3	-7	-911.4809	0.865406	0.5335678

	Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	667	100358.8	NA	NA	NA	NA
2	674	101270.3	-7	-911.4809	0.865406	0.5335678
3	675	101603.2	-1	-332.9253	2.212673	0.1373539

	Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	667	100358.8	NA	NA	NA	NA
2	674	101270.3	-7	-911.4809	0.865406	5.335678e-01
3	675	101603.2	-1	-332.9253	2.212673	1.373539e-01
4	678	105368.7	-3	-3765.4539	8.341929	1.884434e-05

	Uni.Name	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
	<chr>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>	<int>	<int>	<int>
1	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
2	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
3	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
4	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
5	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
6	Albertson College	Yes	587	479	158	38	62	678	41	13500	3335	500	675	67	73	9.4	11	9727	55

	P.value
	<dbl>
perc.alumni	5.554407e-09
(Intercept)	3.931069e-07
Expend	4.163214e-07
Outstate	2.564778e-06
Room.Board	1.017457e-03
Top25perc	1.635636e-03
Apps	3.277498e-03
Private	8.690303e-03
Personal	1.430683e-02
PhD	2.998477e-02
P.Undergrad	4.790074e-02

	P.value
	<dbl>
(Intercept)	1.776753e-09
perc.alumni	2.372000e-09
Top25perc	9.875857e-09
Apps	1.400662e-08
Expend	1.553428e-08
Outstate	2.395224e-06
Room.Board	1.338150e-03
Private	1.548833e-03
Personal	6.136838e-03
P.Undergrad	1.159664e-02

	P.value
	<dbl>
(Intercept)	7.727327e-16
Top25perc	2.311793e-10
perc.alumni	1.320651e-09
Apps	7.382955e-09
Expend	4.324986e-08
Outstate	6.994819e-07
Room.Board	7.628753e-04
Private	4.243466e-03
Personal	6.083525e-03
P.Undergrad	1.639913e-02

	P.value
	<dbl>
(Intercept)	3.215170e-20
Outstate	1.242399e-12
perc.alumni	5.485671e-11
Top25perc	3.479171e-10
Apps	2.219454e-08
Expend	1.611701e-07
Private	1.383241e-04

	P.value
	<dbl>
Outstate	9.795743e-07
(Intercept)	1.833433e-06
perc.alumni	1.924528e-06
Expend	8.915036e-06
Top25perc	6.898967e-04
Room.Board	7.225534e-04
Personal	3.480108e-03
Apps	8.247470e-03
P.Undergrad	2.600143e-02
Private	4.456978e-02

	P.value
	<dbl>
(Intercept)	1.169547e-16
Outstate	5.567367e-14
Top25perc	3.865583e-09
perc.alumni	1.705947e-08
Expend	1.464758e-06
Apps	2.356472e-06
Private	2.837982e-03